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The analysis of data arising from environmental health studies 
which collect a large number of measures of exposure can benefit 
from using latent variable models to summarize exposure informa- 
tion. However, difflculties with estimation of model parameters may 
arise since existing fitting procedures for linear latent variable mod- 
els require correctly specified residual variance structures for unbiased 
estimation of regression parameters quantifying the association be- 
tween (latent) exposure and health outcomes. We propose an estimat- 
ing equations approach for latent exposure models with longitudinal 
health outcomes which is robust to misspecification of the outcome 
variance. We show that compared to maximum likelihood, the loss of 
efflciency of the proposed method is relatively small when the model 
is correctly specified. The proposed equations formalize the ad-hoc 
regression on factor scores procedure, and generalize regression cal- 
ibration. We propose two weighting schemes for the equations, and 
compare their efficiency. We apply this method to a study of the 
effects of in-utero lead exposure on child development. 



1. Introduction. The association between child lead exposure and neu- 
rodevelopment has been widely studied. Initially research focused on de- 
scribing the cross-sectional relationship between lead exposure measured by 
child's blood lead concentration and mental development [e.g., Fulton et 
al. (1987), Hatzakis et al. (1989)], and later on the association between de- 
velopment and prenatal exposure measured by the concentration of lead in 
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umbilical cord blood [e.g., Bellinger (1989)]. More recently, lead concentra- 
tions in the maternal skeleton and in plasma (a component of whole blood) 
emerged as novel biomarkers of prenatal exposure, and studies have shown 
inverse associations with mental developmental at 24 months [e.g., Gomaa 
et al. (2002); Hu et al. (2006)]. Currently, one focus of this area of research 
is on the time windows during pregnancy when the developing fetus is more 
vulnerable to exposure [e.g., Schnaas et al. (2006)]. 

Studying the issue of time windows of vulnerability to fetal lead expo- 
sure is difficult since direct measures of fetal exposure are not feasible. Cord 
blood lead levels have been used as a measure of fetal exposure, but only 
reflect the third trimester of exposure because the half-life of lead in blood 
is approximately thirty to 45 days [Hu et al. (1998)]. Exposures earlier dur- 
ing pregnancy, for example, the first trimester, may be more important as 
the developing brain may be more susceptible to neurotoxicants during this 
period [Mendola et al. (2002)]. 

1.1. ELEMENT study in Mexico City. The Early Life Exposures in Mex- 
ico City to Neuro- Toxicants (ELEMENT) study recruited prospective moth- 
ers, at or before conception, to address the question of windows of vulner- 
ability to lead exposure. Women were followed during pregnancy to assess 
their exposure to lead. Their children were followed after birth to assess their 
development and lead exposure using, respectively, the mental development 
index (MDI) of the Bayley's scale of mental development [Bayley (1993)] 
and blood lead levels at 3, 12, 18, 24, 30 and 36 months of age. Various 
measures of exposure (lead concentrations in whole blood and plasma) were 
collected on the mother during each trimester of pregnancy and at 1 month 
postpartum; for a small group of mothers, all this information was also cap- 
tured before conception. Measures of lead concentration in blood and plasma 
during pregnancy are the closest surrogate measures of fetal exposure. Lead 
stored in maternal bone leaches out into blood and is thus considered an 
important source of exposure to the fetus, and an important predictor of 
plasma and blood lead levels. Another predictor of blood and plasma levels 
is the rate at which bone resorbs (natural bone remodeling process). Urinary 
cross- linked N-telopeptive (NTx), a measure of bone resorption in units of 
bone collagen equivalents, was also collected. Other information, such as ma- 
ternal age and IQ, and the use of lead-contaminated ceramics (days/week) 
was also collected. Table 1 lists sample characteristics for the outcome as well 
as covariates, and thirteen exposure surrogates; the sample consists of 341 
mother-child pairs. To be included in the analysis we present, the mother 
had to have measurements on at least one of the surrogate measurements 
of fetal exposure, and have completed an IQ test. The children in the sam- 
ple completed at least one of six assessments of Bayley's MDI, and had a 
concurrent blood lead measurement at the time of each MDI assessment. 
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Table 1 



ELEMENT study 


m Mexico City: 


sample c 


■haracteristics 






Time 


N 


Mean 


StDev 


Prenatal exposure surrogates 










Mother's logj Plasma lead 


BP 


11 


4.1 


1.05 




Tl 


153 


3.8 


0.96 




T2 


169 


3.4 


0.86 




T3 


157 


3.5 


0.79 


Mother's whole blood lead 










Laboratory 1 


BP 


11 


8.8 


9.0 




Tl 


155 


6.9 


4.7 




T2 


173 


6.2 


3.1 




T3 


159 


6.7 


3.4 


Laboratory 2 


BP 


29 


8.0 


5.8 




Tl 


172 


7.6 


4.6 




T2 


198 


6.6 


3.2 




T3 


304 


6.9 


3.6 


Child's logj Cord blood lead 


Birth 


238 


2.1 


0.9 


JT (Jb h lid, L cLi fcJAjJ \ja Lli c 










Child's blood lead 


3mpp 


300 


3.8 


1.9 




12mpp 


298 


4.9 


2.9 




18mpp 


321 


6.8 


3.6 




24mpp 


318 


4.8 


3.4 




30mpp 


248 


6.5 


3.6 




36mpp 


280 


6.9 


3.7 


Mealtli outcome 










Child's MDI 


3mpp 


323 


94.2 


5.7 




12mpp 


323 


95.4 


9.1 




ISmpp 


321 


91.1 


8.6 




24mpp 


318 


91.4 


11.3 




30mpp 


248 


92.9 


8.5 




36mpp 


280 


94.1 


8.5 


Mother's bone measures 










Tibia lead 


BP 


19 


9.4 


12.1 


Patella lead 


BP 


23 


13.2 


10.9 


Tibia lead 


Impp 


279 


7.8 


9.6 


Patella lead 


Impp 


335 


10.7 


10.7 


NTx 


Tl 


95 


6.1 


0.74 




T2 


127 


6.5 


0.75 




T3 


137 


7.0 


0.59 



Details on the laboratory procedures to obtain these measures are re- 
ported elsewhere [e.g., LaMadrid-Figueroa et al. (2006); Tellez-Rojo et al. 
(2004)]. Briefly, bone concentrations {^g Pb/g bone mineral) are obtained 
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Table 1 
( Continued) 



Time 



N 



Mean 



StDev 



Covariates 
Mother's IQ 
Age 

% of life in Mexico City 
Weekly ceramics use 
Child's gender 



24mpp 

Scr 

Scr 

Scr 
Birth 



341 
341 
341 
341 
341 



89.6 
25.9 
89.9 
0.60 
0.5 



18.6 
5.1 

27.3 
1.55 



Abbreviations: BP : 
Scr = Screening. 



; Before Pregnancy; Tj = Trimester j; 2:mpp = x months post partuni; 



with a K-X ray instrument (similar to a regular X-ray, but emits lower radi- 
ation). A urine sample and two blood samples were taken at each prenatal 
visit. One blood sample was sent to "Laboratory 1," where whole blood lead 
concentration was measured (/^g Pb/dL). The other blood sample and the 
urine were sent to "Laboratory 2." The samples were processed to obtain 
the concentration of lead in plasma (/ig Pb/dL) as well as a second measure 
of lead in whole blood, and the urinary NTx measure, respectively. 



Table 2 

Estimated associations^ between mental development and prenatal and postnatal lead 
exposure, using available surrogates for prenatal exposure 



Prenatal exposure Postnatal exposure 



Surrogate N /3i s.e. p-value /32 s.e. p-value 



log2 Plasma lead t=\ 


153 


-0, 


.423 


0, 


.482 


0, 


.380 


-0. 


,111 


0, 


144 


0, 


,442 


logj Plasma lead t=2 


169 


-0, 


.623 


0, 


.509 


0, 


,221 


-0. 


,073 


0, 


,145 


0, 


,616 


log2 Plasma lead t=3 


157 


-0, 


.429 


0, 


.516 


0, 


,406 


-0. 


,122 


0, 


145 


0, 


,403 


Mother's whole blood lead 




























Laboratory It^i 


155 


-0, 


.834 


0, 


.366 


0, 


,023 


-0. 


,093 


0, 


140 


0, 


,508 


Laboratory lt=2 


173 


-1, 


.201 


0, 


.417 


0, 


,004 


-0. 


,023 


0, 


141 


0, 


,872 


Laboratory lt=3 


159 


-0, 


.960 


0, 


.494 


0, 


,052 


-0. 


,063 


0, 


145 


0, 


,664 


Laboratory 2t=i 


172 


-0, 


.935 


0, 


.405 


0, 


,021 


-0. 


,111 


0, 


139 


0, 


,426 


Laboratory 2t=2 


198 


-1, 


.048 


0, 


.377 


0, 


,005 


-0. 


,041 


0, 


,136 


0, 


,762 


Laboratory 2t^z 


304 


-0, 


.485 


0, 


.324 


0, 


,135 


-0. 


,162 


0, 


113 


0, 


,154 


log2 Cord blood lead 


238 


-0, 


.223 


0, 


,422 


0, 


,598 


-0. 


,312 


0, 


,124 


0, 


,012 



"From regression models for longitudinal data estimated with generalized estimating equa- 
tions, assuming exchangeable correlation structure, and adjusted for maternal age and IQ, 
child's gender, child's age using indicator variables for each time point, child's blood lead 
concentration and gender by time interactions. 
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One of primary goals of the study was to obtain an unbiased measure of as- 
sociation between development and fetal exposure during the first trimester 
of pregnancy, while adjusting for postnatal (childhood) exposures. Typically, 
regression analysis is used to describe such association, while adjusting for 
other factors such as maternal age and IQ, and child's blood lead concen- 
tration at the time of the MDI measurement. Because fetal exposure is not 
directly observed, any of up to thirteen lead concentrations could serve as 
a marker of this exposure. Table 2 gives the regression coefficient for each 
of the prenatal exposure surrogates. For ease of comparison across multiple 
surrogates, the coefficients are expressed in units of log2-plasma lead during 
the first trimester (i.e., we first standardize each surrogate to have mean 
zero and unit variance, multiply by the standard deviation of first trimester 
log2-plasma lead, and estimate the model using the transformed surrogate). 

Interpreting the results in Table 2 gives rise to various statistical con- 
cerns. One concern is multiple testing, since an immediate attempt at in- 
terpretation would be to compare the significance of each coefficient against 
a predetermined significance level. Using a Bonferroni correction to correct 
for multiple testing would lead to no "significant" findings. Another concern 
is that the models use only available cases, such that not all regressions 
include the same mother-child pairs. However, mothers with available data 
for these regressions are older, more likely to use contaminated ceramics, 
and have higher IQ. Further, the regression coefficients may be biased since 
this approach does not account for measurement error in the surrogate mea- 
sures of exposure. Last, including more than one (or at most a few) prenatal 
exposure biomarkers in the same model leads to problems with collinearity, 
as the correlations among prenatal exposure biomarkers are high, ranging 
from 0.52 to 0.9. Thus, although this data collection provides a wealth of 
exposure information, it also gives rise to methodological issues when ana- 
lyzing the association between exposure and health outcomes. 

1.2. Latent variables: a framework for modeling complex exposure data 
and implications for ELEMENT study. Latent variable modeling, which 
has been extensively used in social science research [Bollen (1989)], is an 
increasingly popular analysis tool in medical research and environmental 
epidemiology studies involving complex exposure data [Budtz-j0rgensen et 
al. (2002); Nikolov et al. (2006); Rabe-Hesketh and Skrondal (2008)]. La- 
tent variable models provide a framework to reduce the dimensionality of 
the exposure data and incorporate information regarding the underlying bi- 
ological relationships between the exposure measurements. In this modeling 
framework, multiple measures of an exposure can be viewed as surrogates 
for a true but unobserved exposure, reducing the dimensionality of predictor 
variables. In the in-utero lead study, the various measured lead concentra- 
tions can be viewed as characterizing latent fetal exposure that changes with 



6 



B. N. SANCHEZ, E. BUDTZ-J0RGENSEN AND L. M. RYAN 



time [e.g., Roy and Lin (2000)]. The child's blood lead concentrations can 
be viewed as indirect measures of an overall postnatal exposure. In the next 
section we propose a latent variable model which succinctly describes the as- 
sociation between mental development and the in-utero and post-natal lead 
exposure data, and is more parsimonious compared to multiple regression 
analysis. 

Although the advantages of latent variable models are clear, some of the 
disadvantages include their lack of robustness to model misspecification and 
some disagreements about parameter estimation. Latent variable models 
are susceptible to the assumptions imposed on the covariance structure of 
residual errors. That is, incorrect variance specification leads to biased esti- 
mation of the parameters of primary interest, namely those describing the 
association between the exposure and outcome [Reddy (1992); Hoogland 
and Boomsma (1998); Sammel and Ryan (2002)]. In the lead study, the 
statistical significance and magnitude of the latent exposure coefficient ob- 
tained via maximum likelihood estimation depends heavily on the assumed 
covariance structure for the MDI residuals (Table 3). This provides empir- 
ical evidence of the strong dependence of the regression coefficients on the 
covariance structure for the outcome residuals and related bias. In a simu- 
lation study we quantify the magnitude of the bias and show that it is not 
in a consistent direction. While methods that relax classic distributional as- 
sumptions (e.g., normality) on the error terms have been developed [Browne 
(1984); Arminger and Schoenberg (1989)], the more challenging problem of 
relaxing correct specification of residuals' covariance matrices remains an 
open problem. 

A related area of discussion regarding parameter estimation involves mod- 
els where surrogates of exposure can be modeled separately from the model 
relating the latent exposure to the health outcome. Joint estimation of the 
model for the exposure measurements and the regression model between the 
latent exposure is advocated by many. The primary arguments in favor of 
joint estimation are that it increases efficiency and eliminates possible bi- 
ases from two step approaches to estimation [Bartholomew (1981); Bollen 
(1989); Iwata (1992); Wall and Li (2003)]. However, joint estimation can 
make the models more susceptible to model misspecification [Hoogland and 
Boomsma (1998)]. In the fetal lead exposure example, we were particularly 
concerned about obtaining biased effects of lead exposure due to misspec- 
ified covariance assumptions on the longitudinal health outcome (Bayley's 
MDI); the estimated effect varies greatly depending on the covariance as- 
sumption. Hence, we sought an alternative estimation approach which is 
robust to misspecification, but does not lose too much efficiency. 

We propose an estimating equations approach for models with latent ex- 
posures and longitudinal outcomes that is robust to misspecification of the 
conditional variance of the longitudinal outcomes given the true exposure. 
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Table 3 

Estimated effects'' of latent trimester 1 exposure and postnatal exposure on child mental 
development based on the model from Figure 1; results from various fitting procedures 



Prenatal exposure Postnatal exposure 



Method 


Covariance structure 


Pi 


s.e. 


p- value 


^2 


s.e. 


p- value 


MLE 


Independence 




-2.4600 


0.5280 


0.000 


-0.7730 


0.5520 


0.081 




CS 




-1.0370 


0.5730 


0.035 


-0.8480 


0.6430 


0.094 




CSH 




-0.9420 


0.5390 


0.040 


-0.7500 


0.6010 


0.106 




Unstructured 




-0.6350 


0.4340 


0.072 


-0.2110 


0.4870 


0.333 


GEE 


Independence 


















(/3r,/32*) = (o,o) 




-1.0231 


0.5745 


0.037 


-0.9719 


0.5484 


0.038 




(/3l*,/32*) = (-1.0,- 


-0.8) 


-1.0290 


0.5729 


0.036 


-0.9623 


0.5462 


0.039 




{Pi, Pi) not fixed 




-1.0400 


0.5733 


0.035 


-0.9650 


0.5464 


0.039 


GEE 


Exchangeable 


















(/3r,/32*) = (0,0) 




-0.9924 


0.5604 


0.038 


-0.8767 


0.5535 


0.057 




(/3i*,/32*) = (-1.0,- 


-0.8) 


-0.9941 


0.5598 


0.038 


-0.8733 


0.5526 


0.057 




(A*,/?!) not fixed 




-0.9967 


0.5599 


0.038 


-0.8757 


0.5527 


0.057 


GEE 


Exchangeablett 






















-1.1021 


0.4911 


0.012 


-0.1389 


0.1074 


0.098 




PI = -1.0 




-1.1016 


0.4908 


0.012 


-0.1392 


0.1073 


0.097 


^Adjusted for maternal age 


and IQ, child's 


gender, 


child's aj 


^e using i 


ndicator variables 



for each time point and gender by time interactions. 

tt Model uses observed child's blood lead concentration as measure of postnatal exposure. 

The proposed approach can be viewed as a generahzation of regression cal- 
ibration [Carroll et al. (2006)], and formalizes an ad-hoc procedure pop- 
ular among psychometricians, namely, regression on factor scores [Tucker 
(1971); Skrondal and Laake (2001)]. It loses little efficiency compared to 
maximum likelihood estimation when model assumptions are met. In addi- 
tion, we show that, under certain study designs, our proposed estimating 
equations approach can yield estimates that are more efficient than those 
from regression calibration. The proposed approach can easily accommodate 
studies with many patterns of missing data among the exposure measure- 
ments. 

The organization of this paper is as follows. In Section 2 we present more 
details about the study of in-utero lead exposure and neurodevelopment in 
children, propose a latent variable model for the data, and use the example 
to introduce many of the latent variable concepts. In Section 3 we develop 
a more general model for longitudinal responses where predictors of interest 
are latent. This more general specification allows us to more succinctly de- 
scribe estimation methods, and bias and efficiency issues in Sections 4 and 
5, respectively. Specifically, in Section 4 we review maximum likelihood es- 
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timation, introduce the robust estimating equations approach, and discuss 
how the proposed approach generahzes factor-score regression and regression 
cahbration. Section 5 compares maximum hkehhood estimation, regression 
cahbration and the proposed approach in terms of bias, mean squared error 
and variance of the estimated exposure effect. Finally, Section 6 discusses the 
conclusions and implications of the estimation approach proposed herein. 

2. Latent variable model for latent lead exposure, and neurodevelopment. 

Given the large number of exposure measurements available, we develop a 
way to synthesize the exposure information using a latent variable model. 
We think of the overall model as two separate pieces: a model for the ex- 
posures, "the exposure model," and a model that links the latent exposures 
to the mental development outcome, "the outcome model." In synthesizing 
the exposure information, we take into account the longitudinal aspect of 
the exposures, as well as the biological relationships between the various 
exposure biomarkers. The model is detailed below algebraically, and is also 
shown graphically in Figure 1. 

The exposure part of the model takes into account the longitudinal as- 
sessment of the exposure biomarkers by positing the existence of a latent ex- 
posure at each trimester of pregnancy. The latent exposure at each trimester 
is assumed to be indirectly measured by plasma and blood lead concentra- 
tions. That is, letting Uu represent the latent exposure for individual i at 
each trimester of pregnancy t, t = 0,l,2,3 (t = refers to before pregnancy), 
we propose 

Xiit = Uit + Silt, model for plasma lead, 

\ Xi2t = i^2t + ^2Uit + Si2t, model for blood lead (Laboratory 1), 



Xi43 = ^^43 + ^4:Ui3 + 6i43, model for cord blood, 

where (5's are zero-mean measurement errors. Given that Uus are not ob- 
served, they do not have a natural location and scale. In this model, however, 
the latent lead exposure is assumed to have the same mean and the same 
units as lead concentration in plasma. This information in conveyed by the 
equation Xm = Uit + Snt- The i^'s and A's in the other equations shift the lo- 
cation and translate units between the various measurements and the latent 
exposures. 

Latent variable models typically assume that surrogate measures X are 
conditionally independent of each other given the latent variables, for ex- 
ample, Ut- In other words, the covariance matrix for the 5's is assumed to 
be diagonal. However, after modeling the mean of the blood lead levels, cor- 
relations among the residuals within laboratory may exist. We model the 
correlation of the errors i ^ = 0, . . . , 3, across time as using an autocorrela- 
tion structure of order 1: cor(52o, ^21) = cor((^2i> (^22) = cor(522) (^23) = Pi, and 
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|BloodPB^#1T2f 
Blood P8, #2"t2| ' 

[ Plasma PB. T3 [ 
^[□lood PB. #1T3h 



« I Blood PB. #j TSj ^-'^^ 



Cord Blood PS 



MDI 12 



MDI IB 



MDI 24 




Blood PB 3Smo | 

Blood PB 30mo | 
Blood PB 24mo | 



Blood PB ISmo | 



Latent Variabtc 
] Krror Projie Measureinent 
Covariaie 



Fig. 1. Path diagram showing relationships between exposure surrogates, latent exposures 
and mental development index. Covariates are not shown in the figure. However, all la- 
tent variables and Bayley's MDI are regressed on maternal age. Additionally, bone lead 
burden is regressed on maternal percent of life in Mexico City; circulating lead levels are 
regressed on weekly leaded ceramic use; Bayley's MDI additionally regressed on mater- 
nal IQ and child's gender, indicator variables for measurement occasion, and gender by 
occasion interaction. 



similarly for (Jat , t = 0, . . . , 3. However, 621 is assumed independent from d^t' 
for all time points t,t' (different laboratories). In Figure 1 these correlations 
are depicted with double headed arrows between the respective measures. 

Bone lead concentration was measured from the patella and tibia bones 
before and after pregnancy, for a total of up to four measures of bone lead 
from each mother (two before conception and two more at one month post 
partum), and up to three measures of bone resorption rates were collected. 
Bone resorption rates, measured at trimester t = 1,2,3, are modeled using 
Xt4 = C/4 + kit + et4, where U4 represents the mother's intrinsic resorption 
rates, and ki is a fixed time effect. The model for bone lead concentrations 
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IS 




Ui5 + 6ii5, 

1^45 + ^2bUih + (5i45) 



patella lead, before pregnancy, 
tibia lead, before pregnancy, 
patella lead, 1 mo. post partum, 
tibia lead, 1 mo. post partum. 



which assumes that bone lead burden, C/5, is measured in units of patella 
lead concentration. The population average for bone lead burden is assumed 
to be equal to the average observed for patella lead before pregnancy, and 
changes by 2^35 after pregnancy. 

The next step in defining the exposure model is to relate latent variables 
to each other and to covariates which may predict the latent exposure. The 
purpose of this model is to be able to borrow exposure information across 
individuals who may have similar covariates or exposure levels. This is help- 
ful in characterizing first trimester fetal exposure when mothers were not 
interviewed at the earlier trimesters of pregnancy. 

Circulating lead levels change with time and are affected by maternal 
lead burden and the rate at which bone resorbs. Thus, we model them as 
Ut = ao+7i,it4 + 7i,2f/5+72,is(i) + 72,2W^i+Ci, where s{t) =t, for t = 0, 1,2, 
and s(3) = 2; Wi is use of contaminated ceramics, and var[(^0) • • • > ^3)"^] is 
unstructured to avoid possible residual error covariance misspecification in 
the exposure model. The effects of bone resorption rates, C/4, and bone lead 
burden, C/5, on circulating lead are given by 71^1 and 71^2, respectively. Fi- 
nally, we model the effects of covariates on bone lead burden and bone 
resorption rates: C/4 = 04 -|- 72,3^^2 + ^4 and C/5 = 05 + 72,4^^2 + 72,5^3 + Cs, 
where W2,W3 are maternal age and percent of life lived in Mexico City. 
Increases in maternal age are associated with increased bone lead concen- 
tration and bone resorption, while larger percent of life lived in Mexico City 
(i.e., longer exposure time) is associated with increased bone lead burden. 
With this we have arrived at a model for prenatal exposures. 

The postnatal environment can also be described using a latent variable 
assumed to be indirectly observed through the child's blood lead measure- 
ments at the time of the MDI assessments. Denoting j as the measurement 
occasion for the child's MDI and blood lead, the model for XijQ, the blood 
lead concentration at the jth occasion, is XijQ = vjq + XjeUiQ + 5ijQ, where 
UiQ represents postnatal exposure. We fix the location and scale of the post- 
natal exposure variable to that of Xhq. In Figure 1 the postnatal exposure 
variable is assumed to be independent from prenatal circulating lead expo- 
sure (no double headed arrows connecting them). This assumption slightly 
simplifies the figure, and has little effect on the estimated parameters of 
interest (e.g., <10% change in the parameters in Table 3). 

Finally, we arrive at the "outcome model" linking the exposure model to 
the longitudinal health outcome. Denoting Yij as the jth outcome for the ith 
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child, the longitudinal outcomes model is Yij = f3oj +PiUii + P2Uie + KfZij + 
Eij, where Zij is mother's age and IQ, child's gender, and occasion by gender 
interactions. The coefficient /3i represents the association between latent first 
trimester exposure and child development. The contribution of the postnatal 
environment to MDI scores is modeled by /32f^i6j alternatively, UiQ could be 
replaced by the observed blood lead concentration at the jth occasion XijQ. 
For direct comparability of the estimated P2, in this alternative model we 
use P2X^jQ, where Xf^Q is a version of XijQ scaled to have variance equal 
to var(Xji6). Residuals Si = (ea, • • • , EmJ are assumed to be independent 
of Uii, UiQ and Zij, but can have various types of correlation structures 
among themselves. In Figure 1 the variance matrix for is assumed to be 
diagonal, since there are no correlation arrows (double-headed) connecting 
the outcomes Yij to each other. We explore the consequences of misspecifying 
this covariance assumption. 

In the next sections we will discuss the details of the estimating ap- 
proaches for this type of model, but at this time we give estimates of the 
regression coefficients obtained under various approaches. Table 3 shows the 
parameter estimates using maximum likelihood estimation (MLE) for model 
parameters under various choices of the correlation matrix for Sj. The MLE 
estimate of Pi more than doubles when the assumed correlation structure 
for the outcome is independence in comparison to compound symmetry. 
In some instances the maximum likelihood estimator is highly significant 
{p < 0.001), and in others not significant {p = 0.07). Variability in the esti- 
mated effect of postnatal exposure is also observed, although not to the same 
degree. The variability in the estimated exposure coefficients and in the in- 
ferences obtained under different covariance structures raises concern about 
the robustness of the MLE approach. Table 3 also shows estimates obtained 
via estimating equations approaches and regression calibration described in 
later sections. For prenatal exposure, the estimates and p-values from these 
alternative approaches are stable at about 1.0 point decline in MDI for a 
doubling of plasma lead concentration, and p = 0.037, respectively. 

Estimates of other parameters, for example, maternal IQ and age coeffi- 
cients, are stable across the MLE and estimating equation approaches. This 
supports the fact that MLE estimation gives biased estimates only for the 
latent variable coefficients when the outcome's covariance structure is mis- 
specified (Section 4.1). The estimated increase in MDI is 0.54 points (95% 
CI: 0.01, 1.1) for a 5-year increase in maternal age and 0.67 points (95% CI: 
0.35, 1.02) for a 10 point increase in maternal IQ. The estimated association 
between MDI and gender is also stable across estimation approaches, al- 
though it significantly varies across measurement occasion (p < 0.001). The 
largest gender difference occurs at 24 months, where girls score an average 
of 4.8 points higher than boys (95% CI: 2.2, 7.5). 
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Table 3 also shows the results from modeling the postnatal environment 
by replacing C/jg by the observed child's blood lead concentration at the jth 
occasion. As can be seen, the parameter estimate is greatly attenuated to- 
ward zero, as would be expected under the assumption that child's blood 
lead is a surrogate measure (with error) of the postnatal environment [Car- 
roll et al. (2006)]. However, the association with maternal age also changes 
by nearly 10% to 0.60 (95% CI: 0.04, 1.16), as might be expected given that 
the effect of the measurement error in one variable may extend to other 
covariates [Budtz-j0rgensen et al. (2003)]. 

3. More general framework: latent exposures model with longitudinal 
responses. Before we discuss the estimation approaches used to arrive at 
the estimates in Table 3, we write a more general framework for the model 
discussed in the previous section. Let Yi = {Yii,Yi2, ■ ■ ■ , ^mj^ represent a 
vector of rij continuous responses taken on the ith of N subjects at occa- 
sions j = l,2,...,nj. These responses may depend on I latent exposures 
Ui = (Uii, . . . ,Uii)^ , which are indirectly measured by p observed surro- 
gates Xi = {Xii, . . . , Xip)'^ . In the example / = 6, but we were interested 
only in the effect of Un and UiQ on Yi. Covariate data are denoted by 
Wi = {Wii, . . . ,Wir)'^ and Zi = {Zn , . . . , Zirn)'^ . Covariates Zij are q x 1 
vectors assumed to be measured concurrently with the jth repeated out- 
come, Yij, and may include variables to model time effects. Covariates Wi 
are ascertained concurrently with Xi. 

The exposure model can then be succinctly written as 

(2) Xi = u + AUi + KWi + 6i, 

with the latent predictors Ui depending on fixed covariates and other latent 
variables 

(3) U^ = a + ^lUi + T2W^+C^, 

and the outcome model as 

(4) =po+P^U^ + J Zij + e,j . 

In (2) the (p x 1) vector of errors 6i represents measurement variability, in- 
cluding measurement error; we assume E{5i\Ui, Zi,Wi) = and 
vaT{5i\Ui, Zi,Wi) = ^s- In (3) Fi is an / x Z matrix with all diagonal ele- 
ments equal to zero, and we assume that (// — Ti) is invertible, where Ii 
is an identity matrix of the same dimension. In the lead example, the only 
nonzero entries of Ti are those corresponding to 71^1 and 71^2, the effects of 
bone lead concentration and bone resorption rate on circulating lead levels. 
We suppose E{£^i\Zi, Wi) = and var(,^j|Zj, Wi) = ^, and that S,i is indepen- 
dent from 5i. The dimensions of the parameter matrices A2 and K are 
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p X I, p X I and px r, respectively; and those of a and r2 are I x 1 and / x r, 
respectively. 

In the outcome model, E{ei\Ui, Zi,Wi) = 0, and vav{ei\Ui, Zi,Wi) = Oe-, 
where the x rij matrix Q^. can vary with i [e.g., for random effects with 
variance A, rjg. = ZiAZj" + (^"^lui, where /„. represents an rij x rij identity 
matrix. Laird and Ware (1982)]; in the sequel we may drop the subscript i 
in this matrix for ease of exposition. For notational convenience, we assume 
that if any covariates Wi are to be included in the outcome model, then 
they are also included in Zj. Further, it is assumed that £i, 6i and S^i are 
mutually independent; typically these errors are also assumed to be normally 
distributed. Finally, as it will be further discussed in Section 4, we allow 
subjects to have missing data on some of the surrogate measurements, Xi. 
We hereafter refer to {(3o, P'^ , k^) as conditional mean parameters, and note 
that P is the parameter of primary interest. 

Extensive discussions on the interpretation and identifiability of other 
parameters in (2)-(4) appear in the literature [e.g., Bollen (1989); Skrondal 
and Rabe-Hesketh (2004); Sanchez et al. (2005)], thus, we only make a few 
remarks. First, note that the equation describing the relationship between 
surrogates and latent variables (2) also includes a covariate term. The matrix 
K is typically sparse, with a few nonzero elements that allow for item bias 
[Beck (1982)]. Item bias consists of differential effects of covariates on par- 
ticular surrogates that go beyond the effect of the covariate on the (parent) 
latent variable. In psychometrics, where these models have enjoyed much 
use, a typical example of item bias is differential responses by gender on a 
specific item (surrogate). In the context of the lead example, the surrogates 
are biomarkers of exposure, such that demographic characteristics may not 
necessarily affect a particular surrogate differently from other surrogates. 
However, one possible item bias covariate might be a genetic variant. For 
instance, it is hypothesized that the ALAD genotype changes the affinity 
of lead to red blood cells [Bergdahl et al. (1998)]. Thus, people with the 
ALAD variant would have different average lead levels in red blood cells 
compared to ALAD-wildtypes, but the concentrations of plasma lead may 
not necessarily differ. 

Second, the matrix A will typically have a pre-defined pattern of zeroes 
and ones to reflect knowledge about the relationships between surrogates 
and latent variables and defining the scales of the latent variables. Given 
the patterns of zeroes, (2) can be thought of as confirmatory factor analy- 
sis, which avoids the identifiability concerns typically associated with factor 
analysis [Bollen (1989); Jolliffe (1998)]. 

4. Estimation. We now discuss maximum likelihood estimation and the 
proposed estimating equations approach using the more general framework 
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for the model discussed in Section 3. We write score equations for the model 
to study how invalid covariance assumptions induce bias in the estimate for 
j3. We then introduce estimating equations to eliminate the bias, generalize 
regression calibration, and formalize regression on factor scores. 

We begin by introducing notation common to all approaches. Let 9 de- 
note all model parameters. Because it will be helpful in efficiency calcula- 
tions later, we partition 6 into 9'^ = (0f , 6*2", ^J), where 9i = (/3o, 0^ , k^) are 
parameters for the conditional mean in the outcome model, 92 parameter- 
izes the variance matrix of the outcome given the latent exposure, that is, 
r^e = Oe(02)) and 03 is a vector containing the exposure model parameters, 
that is, those that parameterize i'. A, a, Fi, r2 and ^. 

The estimation approaches utilize several marginal and conditional mo- 
ments, which we now define. Given covariates Wi for individual i, the mean 
and variance of the latent exposure are /^^ = E{Ui\Zi,Wi) = (/ — ri)~^(Q + 
T2Wi) and = var{Ui\Zi,Wi) = {I - ri)~^^{I - Ti)-^ ; the mean for the 
surrogates is /i^ = E{Xi\Zi,Wi) = v + A^^ + KWi] and the marginal vari- 
ance of the surrogates is Q,x = var(Xj|Zj, Wi) = A^'^A^ + Q^. The conditional 
moments of the latent predictors, given the error prone measurements, are 

(5) Ui = E{Ui\X^,Zi, Wi) = + ^uA^^^HXi - ^^i), 

(6) ^u = yav{Ui\X,,Zi,Wi) = ^u- ^nA^fl-iA*„. 

Finally, for a subject with rii observed outcomes, the conditional moments 
of the outcome, given the error prone predictors, are functions of (5) and 

(6) : 

(7) //Jfi^ = E{Yij\Xi,Wi, Zi) = /3o + P^Ui + K^Zij, 

(8) = variY,\Xi,Wi, Zi) = + /3^$,/31„T^^, 

where 1„,- is a vector of ones of length For ease of exposition, in what 
follows we drop the subject index i, unless necessary. 

In cases where the surrogate vector, X , is not completely observed, (5)-(8) 
depend on the missing data pattern for the given individual. Such depen- 
dence could be denoted by a subscript (m) representing the mth missing 
data pattern among the X, for example, ^„(„^), ^J-y\x(m) and ^y\x{m)- 

In particular, note that the variance (8) of the outcome depends on the 
missing data pattern for the surrogates, X. This dependence will play a role 
in developing weights for the proposed estimating equations approach. 

4.1. Full maximum likelihood estimation. To estimate the model param- 
eters via maximum likelihood, consider the likelihood contribution of one 
subject, conditional on covariates, W: 



L{9) = f{Y, X\Z, W; 9) = f{Y\X, Z, W; 9^,92, 93)f{X\Z, W; 9^). 
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Assuming normality of e, 6 and ^, then f {Y\X, Z,W;9i, 62,93) 
Normal(//j^|2,, Qyla;), and M^; ^3) ~ Normal (^x, ^^x)- Thus, letting . 
log L(6'), letting subscript k denote the A:th element of /3, and letting Tr de- 
note the trace of a matrix, a subject's contribution to the likelihood score 
equations for 9i is given by 

d£ 

y\x\ 



BP 

Similarly, letting the subscripts 2k and 2>k denote the klh. element of 62 and 
^3 respectively, the contributions for the outcome variance and the exposure 
model parameters are 

. — _ ' I ( ) ^ ) ^ \ i V — II , \l V — II , \^ _l, , , 

'"y\x\ 



2k 



d£ 8 8 

■ log(/(y |X, Z, W; 91,02, 93)) + -— log(/(X|Z, W; ^3)). 



d93k d93k ^ ' ' ' ' ^' ^' d93k 

Under correct model specification, setting the score equal to zero can be 
used to obtain asymptotically unbiased parameter estimates, ^mle- The vari- 
ance of ^MLE can be calculated from the inverse of the information matrix, 
var(6'MLE) = [—Y^f=iE{d'^li/d9 89'^)]"^ . If normality assumptions are not 
satisfied, then robust standard error estimates can be computed [Arminger 
and Schoenberg (1989)]. 

In contrast to linear models for longitudinal data where covariates are 
measured without error [Laird and Ware (1982)], misspecification of the 
error covariance, Q.^, of the longitudinal outcome may induce bias in the 
estimate for (3. This is seen by examining the expected value of its score 
equation. The first term in di/dPk is zero in expectation if the conditional 
mean fj.y\^ = /3oln + P^Uln + f^^Z is correctly specified, which occurs when: 
(a) X are surrogates [i.e., f{Y\U,X,Z,W) = f{Y\U,Z,W), Carroh et ah 
(2006)]; (b) the variance for the error prone surrogates is correctly modeled 
[i.e., var{X\W, Z) = A^'tjA"^ + Qs]', and when (c) the variance for the la- 
tent predictors is correctly modeled, var(U\Z, W) = ^u- The second term in 
81/ 8 (3k is zero in expectation, assuming that the conditional variance of the 
outcome given the surrogates, ^y\x = ^e + P'^'^uP, has been correctly speci- 
fied. The conditional variance is correct only when the conditions above are 



16 



B. N. SANCHEZ, E. BUDTZ-J0RGENSEN AND L. M. RYAN 



met, in addition to (d) correctly modeling fig, the variance of the longitudi- 
nal responses given the true exposure. In Section 5 we assess the magnitude 
of bias in /3 induced due to misspecification of O^. 

4.2. Estimating equations approach. We formulate generalized estimat- 
ing equations [Liang and Zeger (1986)] that relax the dependence on correct 
specification of the conditional error variance structure, O^. The estimating 
equations are given by S = J2 = 0, where S = {Sj^,Sj^,Sj^)'^ has three 
components corresponding to 61,62,03. Each subject's contribution. Si, can 
be similarly partitioned. Letting the subscript k represent the kth element 
of a vector, and dropping the subscript i, an individual's contribution to S 
is given by 

Se,, = \ Tr{ - H\s){y - f^yl.f - ^yl.)}, 

Se,, = -^Jog{fiX\Z,W;6s)), 

where Ry\x is a working covariance matrix discussed later. The estimating 
equations relax the dependence of correct specification of 0^ by eliminating 
the second term in the score equations for /3. Further, note that Sg^ does 
not depend on 61 or O2, which makes the estimation of exposure model 
parameters, ^3, robust to outcome model misspecification. Also, it allows 
the equations for ^3 to be solved separately from those of 61 and 62. The 
estimates obtained from these estimating equations will be called See- 

The variance for the estimates obtained via the estimating equations can 
be calculated from the sandwich formula, Nvavld^^) = B~^AB~~^. The com- 
ponents of the formula, A = E{SS) and B = E{dS / dd'^), can be consistently 
estimated by their empirical analogs 

\ ^ 1 ^ 

J and A = -Y.iS^Sl)\,^-^. 

i=l 

As detailed in the supplementary materials [Sanchez, Budtz-j0rgensen 
and Ryan (2009a)], A and B have a block structure, which enables us to 
write the following variance formula for 61: 

(9) var(^i,EE) = B^lAuB^^ + SfiiSi3var(^3,EE)Sf3^ri'^- 



B 



E 



fdSi 



N^\d6'^ 
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The first term in (9), B^^ AnB^^ , is the asymptotic variance of 9i ob- 
tained from naively regressing Y on the estimated exposure, U (evaluated 
at ^3,ee)- The second term in (9) adjusts the variance of the naive estimator 
by accounting for the estimation of the exposure model parameters, and 
var(6'3,EE) =-633^^33533^. 

We study three alternate forms for the working covariance matrix, Ry^x- 

First, we set Ry^^ = ^y\x = + P'^^uP^^'^ , with evaluated at ^3. Except 
for the fact that the estimate of ^3 is slightly different, this working covari- 
ance is identical to the weight matrix of the score equations. Further, because 
Ry\x depends on subjects with missing data among the surrogates X 
are differentially weighted during the estimation. We call the estimates from 
this approach Peei- 

The other two working covariance matrices are Ry\x = and Ry\x = 
^£ + Pj^uP*^^'^ , where P^, is a fixed, "best guess" value of p. The former 
is a special case of the latter with P^ = 0. Neither of these matrices depend 
on the unknown exposure effect p. We will call estimates obtained using 
the working correlation Ry^x = + PT^uP*^^'^ with /3* 7^ as /3ee2- When 
P^: = 0, solving (5^ , Sj^)'^ = corresponds to fitting the outcome model with 

the true exposure U replaced by the estimate U (evaluated at 03,ee) and a 
working covariance that is independent of the exposure model. As detailed 
below, this last procedure is similar to regression on factor scores and regres- 
sion calibration [Carroll et al. (2006)]. This estimate will be denoted /3rc- 
The three estimating equation estimators under study are as follows: /3ee1, 
^be2 and ^Rc- 

4.3. Connections to regression on factor scores and regression calibration. 
The proposed estimation approach can be compared and contrasted to re- 
gression on factor scores, which has been widely used in the social sciences to 
estimate models with one or more latent variables. It consists of two steps. 
First, factor analysis is used to derive estimates of the latent variables called 
factor scores. At least two ways of estimating factor scores exist [e.g., Skro- 
ndal and Laake (2001)]. Tucker (1971) recommends empirical Bayes (EB) 
estimation of factor scores when they are to be used as predictor variables 
in a regression model. Thus, estimating the exposure model in (2)-(3), and 
using U evaluated at O-^^ee as the estimated factors completes the first step 
of regression on factor scores, with latent variables estimated by EB estima- 
tion. The second step is to use the estimated factors as predictors (in our 
case) or outcomes in a regression model. This would correspond to first es- 
timating ^3 from Sj, = 0, and then plugging in the resulting estimate when 
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solving for 9i and 62- Thus, estimating the exposure effect using regression 
on factor scores corresponds to obtaining /3rc- 

Factor score regression, however, has stopped short in that correct stan- 
dard errors for regression parameters are not computed. That is, the stan- 
dard errors for regression parameters in factor score regression do not typ- 
ically account for the estimation uncertainty of ^s.ee- Recently, resampling 
approaches were suggested as a means to correct such standard errors [Skro- 
ndal and Laake (2001)]. The proposed estimating equations formalize the 
ad-hoc regression on factor scores procedure by expressing the procedure 
in an estimating equations framework, and thereby allowing direct estima- 
tion of correct standard errors without the need for resampling techniques. 
Further, by introducing Ry^x as weights, weighed factor score regression is 
made possible, where observations are weighed according to the amount of 
exposure information available. 

Some researchers argue that regression parameters estimated via factor 
score regression are biased [e.g., Bollen (1989); Wall and Li (2003)] be- 
cause the estimated factors U are still error prone measures of the true 
latent variables. As documented by Skrondal and Laake (2001), the biases 
can be avoided by carefully choosing the way factor scores are estimated 
(e.g., EB estimation when latent variables are predictors). In the proposed 
framework, it is easy to see that Sp has expectation zero such that /3ee 
is, at least asymptotically, unbiased, assuming the exposure model is cor- 
rectly specified. Parallel arguments to those described in measurement error 
methodology can alsojae made for the unbiasedness of Pee- Namely, the 
measurement error of U is of Berkson-type [i.e., the error is independent of 
U: E{Ui - Ui\Ui,Zi,Wi) = E{Ui\Xi,Zi,Wi) - Ui = 0], which, in the linear 
model, inflates the standard error estimates for (3 but does not introduce 
bias [Fuller (1987); Carroll et al. (2006)]. 

The proposed estimating equations are philosophically similar to regres- 
sion calibration, although some differences exist. Regression calibration con- 
sists of regressing the outcome on the estimated exposure given only the sur- 
rogates and covariates [i.e., using Ui = E{Ui\Xi, Zi, Wi) as a predictor], and 
subsequently obtaining correct standard errors for the regression parame- 
ters. In the proposed approach, an estimated value of the latent exposure 
which depends only on the surrogates and covariates can be obtained by 
first using Sg^ = to solve for ^3 , and then calculating an estimated expo- 
sure using (5) evaluated at ^3. Next, similar to regression calibration, the 
estimated exposure can be plugged into Sg-^ = and Sg^ = to solve for 61 
and 62- In regression calibration, however, assumptions on the distribution 
of the surrogates (distribution of measurement error) are not made, other 
than having mean zero conditional on observed covariates. Exposure model 
parameters, ^3, are often estimated via method of moments. In contrast. 
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Se^ is derived from normality assumptions, which were necessary to easily 
account for surrogates missing at random [Little and Rubin (2002)]. That 
is, since exposure measurements X may often be missing not completely at 
random and many patterns of missing data are possible, we opt for max- 
imum likelihood as a method to estimate ^3. Last, regression calibration 
sets Ry\x = ^ei that is, it eliminates the dependence of Ry\x on the miss- 
ing pattern for X. Thus, regression calibration does not provide differential 
weighting for observations with more or less exposure information. 

4.4. A note on the practical implementation of estimation approaches. 
Many software packages are available to estimate latent variable models 
using maximum likelihood [e.g., Joreskog and Sorbom (1989); Muthen and 
Muthen (1998-2004); Rabe-Hesketh et al. (2004); Fox (2006)]. A few can 
accommodate estimation when data is missing at random, and/or when the 
variable's distribution is not normal [e.g., Muthen and Muthen (1998-2004); 
Rabe-Hesketh et al. (2004)]. A more complete software review is available 
[Rabe-Hesketh and Skrondal (2008)]. 

In the example and in the simulations presented in the next section we 
used the package Mplus to obtain maximum likelihood estimates. To obtain 
the estimating equation estimates we used a combination of Mplus [Muthen 
and Muthen (1998-2004)] and R. Because the estimating equations for ^3 
can be solved separately from those of 61,62, we first used Mplus to solve 
iSgg = 0. Estimates of U were obtained in R by importing the data and Mplus 
parameter estimates. Subsequently, we used R to solve Sq^ {63) = 0, {63) = 
0. Finally, corrected standard errors derived from (9) were obtained. All 
functions in R used to solve for parameter estimates and compute corrected 
standard errors are available as supplementary materials from the journal's 
website [Sanchez, Budtz-j0rgensen and Ryan (2009b)]. 

5. Comparing estimation approaches: bias, mean squared error and effi- 
ciency. 

5.1. Bias and mean squared error under a misspecified model. Via a sim- 
ulation study, we assessed the magnitude of the bias and mean squared error 
of the estimated exposure effect /3 when the conditional covariance struc- 
ture, Oe, is misspecified. Data were generated from a one latent variable 
model with longitudinal outcomes, where was designed to have a hetero- 
geneous autocorrelation structure with parameter p. We considered a range 
of values of /?, and various strengths of the correlation in the outcomes, 
yO = 0, 0.25, 0.5, 0.75 (p = allows us to assess the effect of incorrectly as- 
suming equal variance across time). Other parameters are detailed in the 
supplementary materials [Sanchez et al. (2009a)]. For each combination of 
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Fig. 2. Bias and Mean Square Error for various estimation approaches, under incorrect 
conditional covariance assumption for the longitudinal outcome. In the true model, fig has 
a heterogeneous autoregressive structure of order 1 with various strengths of the autocor- 
relation parameter p as shown in panel (a). Panels (a)-(c) and (d)-(e) show the resultant 
bias or MSE for /3mle, /3rc and /3eei under an incorrect independence and compound sym- 
metry assumption in the fitted model, respectively. For /3mle the bias dominates the bias; 
for I3rc and /3eei the bias is a negligible fraction of the MSE. 



P and p, two hundred data sets of = 500 were generated. An otherwise 
correctly specified model was then fitted to the simulated data, except that 
independence or compound symmetry, = I -\- a^ll^ , variance struc- 
tures were assumed. Data for the surrogates of exposure was complete. 
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Figure 2(a) displays the average bias for the exposure effect estimated 
via maximum likehhood when an independence variance structure is incor- 
rectly assumed. The bias in Pmle can be either positive or negative. The 
bias is positive when the true exposure effect is small, but is negative for 
larger effects. Correctly assuming conditional independence but incorrectly 
assuming homogenous variance results, in relative terms, in underestima- 
tion of the exposure effect of about 20% [Figure 2(a), short-dash curve]. In 
contrast, the bias for the regression calibration estimator and the proposed 
estimating equations (/3eei) is nearly zero in all scenarios (bias not shown). 
Figures 2(b) and (c) show the empirical mean squared error for these esti- 
mators, of which the bias is a negligible part. The bias for /3ee2 is similar to 
that of (3rc and Peei, independent of the value of (3^ (figure not shown). The 
mean squared error for the regression calibration estimate and that of Peei 
are approximately the same, and are much smaller than that of Pmle- 

Figures 2(d)-(f) show the bias or mean squared error resulting from incor- 
rectly assuming a compound symmetry structure. Here, the MLE estimates 
are biased toward the null, with the magnitude of the bias increasing with 
increasing magnitude of the true effect. In relative terms the bias is about 
20% irrespective of the true effect or the correlation parameter p. Again the 
bias in the estimates obtained via estimating equations is nearly zero, and 
the mean squared error for the regression calibration estimate was the same 
as that of /Seei- 

5.2. Theoretical relative efficiency under correct model specification: O^ei 
vs 9mle- We consider the efficiency of ^eei (i-e., using weights that de- 
pend on the unknown (3) relative to ^^le by studying the information ma- 
trix of both estimators, and the form of the conditional variance = 
vai(Y\U, Z,W). As detailed in the supplementary materials [Sanchez et al. 
(2009a)], the similarity in the form of the information matrices provides 
an insight as to when the estimating equations approach loses information 
compared to maximum likelihood. When 0,^ is parameterized with linear 
functions of 02 (e.g., compound symmetry, unstructured, banded), it can 
be shown that the estimating equations approach loses information only in 
estimating O^. However, because 6i is not independent of 63, the loss of in- 
formation in ^3 affects the asymptotic relative efficiency of 9i. When fl^ is 
parameterized with nonlinear functions of ^2 (e-g-j autoregressive structure), 
information is also lost for 9i, and higher losses in efficiency are expected. 

To quantify the loss of efficiency in the parameter of interest, /3, we eval- 
uated exact expressions for the relative efficiency of Peei compared to Pmle, 
again using a model with one latent exposure and longitudinal outcomes as 
in Section 5.1. We considered a compound symmetry structure for e's vari- 
ance, Q,e = c^/ + cr^ll"^, in addition to an autoregressive structure of order 
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one. In the calculations we allowed [3 to vary over a range of values, and 
considered three scenarios of the magnitude of the surrogates' measurement 
error variances. The specific parameter values for the simulations are given 
in the supplementary materials [Sanchez et al. (2009a)]. 

Figure 3 shows the results of the efficiency calculations, and illustrates the 
dependency of the relative efficiency on several parameters. First, the loss of 
efficiency in /3eei is fairly small for small values of the exposure effect, but 
increases with larger values of (3. This increase can be explained by the fact 
that when the exposure effect (3 increases, the outcome model will hold more 
information about the exposure model parameters ^3. This information is 
utilized in MLE, but not in the estimating equations. The loss of information 
in ^3 affects the efficiency of (3 because these parameters are not indepen- 
dent. Efficiency also depends on the structure of = var(y |i7, Z, W). Under 
compound symmetry. Figure 3(a)-(c), the relative efficiency does not exceed 
1.06. In contrast, the loss of efficiency under an autoregressive structure for 
the can range up to 25% for large effects, Figure 3(d)~(f). The larger 
loss of efficiency under the autoregressive structure is to be expected given 
that the covariance matrix is not linear in p, as previously noted. Under 
a given correlation structure, the loss of efficiency decreases with stronger 
correlations in the outcome. Finally, the loss of efficiency increases when the 
surrogates' measurement error variances increase. In applications this may 
be relevant when selecting between maximum likelihood and the estimating 
equations approach in the presence of poor exposure surrogates. 

5.3. Variance ratios under correct model specification: /?ee2 o,nd /?rc vs 
/5eei- We compared the variance of /5ee2 and /3rc to that of /3eei; that is, 
we evaluated the effect of replacing Ry\^ = fig + /3-^^„/?l„l^, with Ry\^ = 

^£ + /3r^«/^*lnln ) where /3* is a fixed constant (including zero). To calculate 
variance ratios, we simulated data from a model with one latent exposure, 
similar to that described in Section 5.1. We considered, in addition, a larger 
number of exposure surrogates p = 12, and allow surrogate data to be miss- 
ing completely at random under various scenarios. In the first scenario, miss- 
ing data patterns were equally likely. In a second scenario the probability of 
a missing data pattern was proportional to the theoretical variance of the 
latent variable given the observed pattern of surrogates, var(C/|X(„), Z, W). 
In the third, the probability of a missing data pattern was inversely propor- 
tional to var(f7|X(m), Z, M^). Hence, in the first scenario the distribution of 
the values of var(?7|X(^), Z, W) have a uniform distribution, whereas in the 
second and third, these variances were skewed to the left and right, respec- 
tively. For each of the three values of /3 = 0.5, 1, 2, one thousand data sets of 
A'" = 1000 observations were simulated. For each data set, we obtained /3ee1i 
and for a range of values of fixed we obtained /3ee2 and fi^c iP* = 0). The 
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Fig. 3. Relative efficiency of maximum likelihood estimate ((3ule) compared to estimat- 
ing equations approach (Peei) in the case of compound symmetry error structure with 
(a) al/{al + a^) = 0.25; (b) ai/{al + a^) = 0.50; and (c) al/{al + a^) = 0.75; and 
autoregressive error structure with (d) p = 0.25; (e) p — 0.50; and (f) p = 0.75. Plots 
show a range of values of the exposure effect (3 in standardized units [standardized(P) 
= var(C/)/\/o"2 + cj'i, ]. Curves represent varying degrees of measurement error in the 
surrogates, expressed as percentage of variability in measurement: 10% (solid), 2^o 
(dashed), 36% (dotted). 



empirical variances of the estimated effects were then calculated as well as 
their ratios. 
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Fig. 4. Figure compares the variance for /3ee2 to the variance of /3eei for three values 
of the true effect: /3 — 0.5 (solid), (3 = 1 (dashed), (3 = 2 (dotted). The ratio is shown as a 
function of the fixed value of (3 used to obtain /3ee2 • Note that Fixed (3 = corresponds to 
he 

Figure 4 shows the simulation results for the case when the values of 
var{U\X^:^-^, Z,W) have a distribution that is skewed to the left and the 
number of surrogates is 12. From the figure, we see that utilizing a fixed 
value of (3, /3*, in Ry\^ = + (3'^^uP*^n^n is practically as good as esti- 
mating P in the weights when /3* is close to the truth. Further, we see that 
efficiency gains of up to 10% are obtained by using a fixed value of /3* close 
to the truth in comparison to a regression calibration = 0). For the case 
when the missingness probability was equally likely or inversely proportional 
to var(C/|X(„), Z, M^), the maximum variance ratio observed from a figure 
similar to Figure 4 was more modest at 2%. Similar patterns were observed 
when the number of surrogates was p = 3. 

5.4. Remarks on fetal lead exposure example. Substantial differences in 
the estimated variances of the exposure effect were not observed for the dif- 
ferent estimating equations estimates (Table 2). A reason is that although 
there is a large number of missing data patterns for the exposure measure- 
ments, most subjects have enough surrogates to make the variance of the 
estimated exposure relatively small and similar across subjects. Further, al- 
though the estimated exposure effect is a 1.0 point decline in MDI scores 
for a doubling in plasma lead concentration at trimester 1, in standardized 
units the effect is roughly 0.19. Thus, given the results of our simulation 
study, considerable differences in efficiency among the estimating equations 
approaches are not expected. 
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6. Conclusions. We used generalized estimating equations to estimate 
parameters in a model for longitudinal responses where the predictors are 
latent. We were motivated by large potential biases in the maximum likeli- 
hood estimate when the conditional variance of the outcome given the latent 
exposure is misspecified. The estimating equations approach relaxes the as- 
sumption of correct conditional variance specification for the longitudinal 
outcomes. When the model is correctly specified, the loss of efficiency of the 
estimating equations is small for small effect sizes. When weighed estimating 
equations are used and the exposure effect estimated as part of the weights, 
the loss of efficiency is negligible. When the exposure measurements (surro- 
gates) are missing, the relative efficiency depends on the distribution of the 
conditional variance of the latent exposure given observed surrogates (these 
variances depend on the missing data pattern). Our estimation approach 
encompasses regression calibration and regression on factor scores as special 
cases. 

The estimating equations approach enables us to show that regression 
on factor scores yields asymptotically unbiased effect estimates when the 
exposure model is correctly specified, and allows us to provide a conceptu- 
ally simple way of computing correct standard errors for the outcome model 
parameters without the need for resampling approaches. In the setting of fac- 
tor score regression, Skrondal and Laake (2001) and Tucker (1971) discussed 
a sufficient requirement to obtain unbiased regression parameters, namely, 
estimating the factor scores using an Empirical Bayes approach we also dis- 
cussed when latent variables are predictors. However, they did not provide 
a simple way of conducting inference on the regression parameters. Further, 
they did not consider studies with longitudinal outcomes, which have the 
additional subtlety of misspecified residual error covariance structures. 

Sample size is an important consideration when justifying the use of com- 
plex latent variable models. Rules of thumb in the literature recommend any- 
where between five to twenty times cases as there are variables in the model 
to ensure stable parameter estimates [Bentler and Chou (1987); Stevens 
(1996)]. To fit the exposure model in the example, at least 260 observations 
would be needed. Sample sizes may also impact the bias and variance com- 
parison results in this paper. While a full evaluation of the impact of sample 
size on our results is beyond the scope of this paper, we conducted some 
exploratory simulations with sample sizes as low as 250. At this reduced 
sample size, the bias of the MLE was as strong as that portrayed in Figure 
2, while the bias of the estimating equations remained very small. The ef- 
ficiency loss in Peei compared to /?mle also remained low, for example, the 
highest value in a figure similar to Figure 3(c) was 1.03. Finally, the variance 
ratio comparing /3ac to Peei was 1.12 when the true effect was 2 (i.e., about 
the same value as the highest point in Figure 4). Further examination of 



26 



B. N. SANCHEZ, E. BUDTZ-J0RGENSEN AND L. M. RYAN 



sample size issues may be warranted, particulary for cases of small sample 
size and missing data among surrogates. 

Practical advantages and disadvantages of the proposed method are as 
follows. The estimating approach enables the practice of reusing stored esti- 
mates of latent exposure values in analyses of various health outcomes. This 
is advantageous in large epidemiological studies where testing the effects of 
a latent exposure on several health outcomes might be of interest. In such 
scenarios, correctly estimating effect's variances requires computing deriva- 
tives of the estimating equations for each outcome's mean parameters (i.e., 
obtaining B13). Automated procedures for this computation can be easily 
implemented. However, separately estimating the exposure model might give 
rise to identifiability problems in cases where there are only two surrogates 
for a latent variable. Such identifiability problems might not arise with full 
maximum likelihood estimation. Furthermore, this approach is confined to 
continuous outcomes. We do not expect the unbiasedness of the approach 
to hold for other outcomes, for which modifications may be needed [Carroll 
et al. (2006)]. Last, the approach still requires correct specification of the 
conditional error variance structure in the exposure part of the model. The 
more difficult problem of relaxing correct specification of the exposure model 
variance structure remains to be studied. 

SUPPLEMENTARY MATERIAL 

Supplement A: Supplement to "An estimating equations approach 
to fitting latent exposure models with longitudinal responses" 

(DOI: 10.1214/08-AOAS226SUPPA; .pdf). We provide details on the sim- 
ulation parameters referenced in Sections 5.1, 5.2 and 5.3. We also provide 
details on variance and relative efficiency calculations mentioned in Sections 
4.2 and 5.2, respectively. 

Supplement B: Computer code supplement to "An estimating equations 
approach to fitting latent exposure models with longitudinal responses" 

(DOI: 10.1214/08-AOAS226SUPPB; .zip). This zipped folder contains sev- 
eral files with example code for the procedures described in this article. 
For specific details on the files, read the "readme.txt" file found within this 
zipped folder. 
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